Project-3: CUSTOMER SEGMENTATION FOR MARKETING CAMPAIGN

CUSTOMER SEGMENTATION

This project notebook is about customer clustering for an ad marketing campaign. I used different machine learning techniques such as k-means model, Principal Component Analysis(PCA) and autoencoders to perform dimensionality reduction.

UNDERSTANDING THE PROBLEM STATEMENT AND BUSINESS CASE

alt text

alt text

DATASET AND FEATURES

alt text

alt text

  • Note: MSRP is the manufacturer's suggested retail price(MSRP) or sticker price. It respresents the suggested retal price of products. MRSP is used to standardize the price of products over multiple store location

WALKTHROUGH

In [89]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import zipfile
#import cv2
import plotly.express as px
import tensorflow as tf
from tensorflow.python.keras import Sequential
from tensorflow.keras import layers, optimizers
from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.layers import Input, Add, Dense, Activation, ZeroPadding2D, BatchNormalization, Flatten, Conv2D, AveragePooling2D, MaxPooling2D, Dropout
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.initializers import glorot_uniform
from tensorflow.keras.utils import plot_model
from tensorflow.keras.callbacks import ReduceLROnPlateau, EarlyStopping, ModelCheckpoint, LearningRateScheduler
from IPython.display import display
from tensorflow.keras import backend as K
from sklearn.preprocessing import StandardScaler, normalize
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity
import plotly.express as px
import plotly.offline as py
import plotly.graph_objects as go

#from google.colab import files #library to upload files to colab notebook
%matplotlib inline
In [90]:
#loading my dataset
sales_df=pd.read_csv('sales_data_sample.csv', encoding='unicode_escape')
In [101]:
sales_df
Out[101]:
ORDERNUMBER QUANTITYORDERED PRICEEACH ORDERLINENUMBER SALES ORDERDATE STATUS QTR_ID MONTH_ID YEAR_ID ... ADDRESSLINE1 ADDRESSLINE2 CITY STATE POSTALCODE COUNTRY TERRITORY CONTACTLASTNAME CONTACTFIRSTNAME DEALSIZE
0 10107 30 95.70 2 2871.00 2/24/2003 0:00 Shipped 1 2 2003 ... 897 Long Airport Avenue NaN NYC NY 10022 USA NaN Yu Kwai Small
1 10121 34 81.35 5 2765.90 5/7/2003 0:00 Shipped 2 5 2003 ... 59 rue de l'Abbaye NaN Reims NaN 51100 France EMEA Henriot Paul Small
2 10134 41 94.74 2 3884.34 7/1/2003 0:00 Shipped 3 7 2003 ... 27 rue du Colonel Pierre Avia NaN Paris NaN 75508 France EMEA Da Cunha Daniel Medium
3 10145 45 83.26 6 3746.70 8/25/2003 0:00 Shipped 3 8 2003 ... 78934 Hillside Dr. NaN Pasadena CA 90003 USA NaN Young Julie Medium
4 10159 49 100.00 14 5205.27 10/10/2003 0:00 Shipped 4 10 2003 ... 7734 Strong St. NaN San Francisco CA NaN USA NaN Brown Julie Medium
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2818 10350 20 100.00 15 2244.40 12/2/2004 0:00 Shipped 4 12 2004 ... C/ Moralzarzal, 86 NaN Madrid NaN 28034 Spain EMEA Freyre Diego Small
2819 10373 29 100.00 1 3978.51 1/31/2005 0:00 Shipped 1 1 2005 ... Torikatu 38 NaN Oulu NaN 90110 Finland EMEA Koskitalo Pirkko Medium
2820 10386 43 100.00 4 5417.57 3/1/2005 0:00 Resolved 1 3 2005 ... C/ Moralzarzal, 86 NaN Madrid NaN 28034 Spain EMEA Freyre Diego Medium
2821 10397 34 62.24 1 2116.16 3/28/2005 0:00 Shipped 1 3 2005 ... 1 rue Alsace-Lorraine NaN Toulouse NaN 31000 France EMEA Roulet Annette Small
2822 10414 47 65.52 9 3079.44 5/6/2005 0:00 On Hold 2 5 2005 ... 8616 Spinnaker Dr. NaN Boston MA 51003 USA NaN Yoshido Juri Medium

2823 rows × 25 columns

In [15]:
sales_df.dtypes
Out[15]:
ORDERNUMBER           int64
QUANTITYORDERED       int64
PRICEEACH           float64
ORDERLINENUMBER       int64
SALES               float64
ORDERDATE            object
STATUS               object
QTR_ID                int64
MONTH_ID              int64
YEAR_ID               int64
PRODUCTLINE          object
MSRP                  int64
PRODUCTCODE          object
CUSTOMERNAME         object
PHONE                object
ADDRESSLINE1         object
ADDRESSLINE2         object
CITY                 object
STATE                object
POSTALCODE           object
COUNTRY              object
TERRITORY            object
CONTACTLASTNAME      object
CONTACTFIRSTNAME     object
DEALSIZE             object
dtype: object
In [8]:
sales_df['ORDERDATE']
Out[8]:
0        2/24/2003 0:00
1         5/7/2003 0:00
2         7/1/2003 0:00
3        8/25/2003 0:00
4       10/10/2003 0:00
             ...
2818     12/2/2004 0:00
2819     1/31/2005 0:00
2820      3/1/2005 0:00
2821     3/28/2005 0:00
2822      5/6/2005 0:00
Name: ORDERDATE, Length: 2823, dtype: object
In [91]:
#convert ORDERDATE into datetime pandas format 
sales_df['ORDERDATE']=pd.to_datetime(sales_df['ORDERDATE'])
In [92]:
#Null elements in 'ADDRESSLINE2'
pd.isna(sales_df['ADDRESSLINE2']).sum()
Out[92]:
2521

Information concerning the types and number of non-null elements for each variable

In [8]:
sales_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2823 entries, 0 to 2822
Data columns (total 25 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   ORDERNUMBER       2823 non-null   int64
 1   QUANTITYORDERED   2823 non-null   int64
 2   PRICEEACH         2823 non-null   float64
 3   ORDERLINENUMBER   2823 non-null   int64
 4   SALES             2823 non-null   float64
 5   ORDERDATE         2823 non-null   datetime64[ns]
 6   STATUS            2823 non-null   object
 7   QTR_ID            2823 non-null   int64
 8   MONTH_ID          2823 non-null   int64
 9   YEAR_ID           2823 non-null   int64
 10  PRODUCTLINE       2823 non-null   object
 11  MSRP              2823 non-null   int64
 12  PRODUCTCODE       2823 non-null   object
 13  CUSTOMERNAME      2823 non-null   object
 14  PHONE             2823 non-null   object
 15  ADDRESSLINE1      2823 non-null   object
 16  ADDRESSLINE2      302 non-null    object
 17  CITY              2823 non-null   object
 18  STATE             1337 non-null   object
 19  POSTALCODE        2747 non-null   object
 20  COUNTRY           2823 non-null   object
 21  TERRITORY         1749 non-null   object
 22  CONTACTLASTNAME   2823 non-null   object
 23  CONTACTFIRSTNAME  2823 non-null   object
 24  DEALSIZE          2823 non-null   object
dtypes: datetime64[ns](1), float64(2), int64(7), object(15)
memory usage: 551.5+ KB

Number of non-null elements for each variable in the dataset

In [7]:
sales_df.isna().sum() #we can also use 'isnull().sum()'
Out[7]:
ORDERNUMBER            0
QUANTITYORDERED        0
PRICEEACH              0
ORDERLINENUMBER        0
SALES                  0
ORDERDATE              0
STATUS                 0
QTR_ID                 0
MONTH_ID               0
YEAR_ID                0
PRODUCTLINE            0
MSRP                   0
PRODUCTCODE            0
CUSTOMERNAME           0
PHONE                  0
ADDRESSLINE1           0
ADDRESSLINE2        2521
CITY                   0
STATE               1486
POSTALCODE            76
COUNTRY                0
TERRITORY           1074
CONTACTLASTNAME        0
CONTACTFIRSTNAME       0
DEALSIZE               0
dtype: int64

Since there are lots of null values in ADDRESSLINE2, STATE, POSTALCODE and TERRITORY, we decide to drop them. We have several options to take into account graphical information. We can choose to keep COUNTRY or CITY or both, we choose to keep country. Plus, variables such us names of costumers are dropped since not relevant here.

In [93]:
df_drop=['ADDRESSLINE1','ADDRESSLINE2','POSTALCODE','TERRITORY','PHONE','STATE','CONTACTFIRSTNAME','CONTACTLASTNAME','CUSTOMERNAME','ORDERNUMBER']
sales_df=sales_df.drop(df_drop, axis=1)
sales_df.head()
Out[93]:
QUANTITYORDERED PRICEEACH ORDERLINENUMBER SALES ORDERDATE STATUS QTR_ID MONTH_ID YEAR_ID PRODUCTLINE MSRP PRODUCTCODE CITY COUNTRY DEALSIZE
0 30 95.70 2 2871.00 2003-02-24 Shipped 1 2 2003 Motorcycles 95 S10_1678 NYC USA Small
1 34 81.35 5 2765.90 2003-05-07 Shipped 2 5 2003 Motorcycles 95 S10_1678 Reims France Small
2 41 94.74 2 3884.34 2003-07-01 Shipped 3 7 2003 Motorcycles 95 S10_1678 Paris France Medium
3 45 83.26 6 3746.70 2003-08-25 Shipped 3 8 2003 Motorcycles 95 S10_1678 Pasadena USA Medium
4 49 100.00 14 5205.27 2003-10-10 Shipped 4 10 2003 Motorcycles 95 S10_1678 San Francisco USA Medium
In [6]:
sales_df.isnull().sum() # Non-null values in the dataset 
Out[6]:
QUANTITYORDERED    0
PRICEEACH          0
ORDERLINENUMBER    0
SALES              0
ORDERDATE          0
STATUS             0
QTR_ID             0
MONTH_ID           0
YEAR_ID            0
PRODUCTLINE        0
MSRP               0
PRODUCTCODE        0
CITY               0
COUNTRY            0
DEALSIZE           0
dtype: int64
In [7]:
# How many unique values of country, product code and product line
sales_df.nunique()
Out[7]:
QUANTITYORDERED      58
PRICEEACH          1016
ORDERLINENUMBER      18
SALES              2763
ORDERDATE           252
STATUS                6
QTR_ID                4
MONTH_ID             12
YEAR_ID               3
PRODUCTLINE           7
MSRP                 80
PRODUCTCODE         109
CITY                 73
COUNTRY              19
DEALSIZE              3
dtype: int64

EXPLORATORY DATA ANALYSIS AND DATA CLEANING

  • Demography of customers
In [34]:
sales_df['COUNTRY'].value_counts().index # Customer's countries
Out[34]:
Index(['USA', 'Spain', 'France', 'Australia', 'UK', 'Italy', 'Finland',
       'Norway', 'Singapore', 'Canada', 'Denmark', 'Germany', 'Sweden',
       'Austria', 'Japan', 'Belgium', 'Switzerland', 'Philippines', 'Ireland'],
      dtype='object')
In [35]:
sales_df['COUNTRY'].value_counts() # Number of customer in each country
Out[35]:
USA            1004
Spain           342
France          314
Australia       185
UK              144
Italy           113
Finland          92
Norway           85
Singapore        79
Canada           70
Denmark          63
Germany          62
Sweden           57
Austria          55
Japan            52
Belgium          33
Switzerland      31
Philippines      26
Ireland          16
Name: COUNTRY, dtype: int64
  • Small function: This function take into argument one feature such as COUNTRY and plot a barplot to the visualise the number of items.
In [103]:
def barplot_visualization(x):
  fig=plt.Figure(figsize=(12,6))
  fig=px.bar(x=sales_df[x].value_counts().index,y=sales_df[x].value_counts(),color=sales_df[x].value_counts().index,height=600)
  fig.show(renderer="notebook")

we call te barplot_visualisation function and we pass it COUNTRY as an argument to visualise number of items per country.

In [105]:
barplot_visualization('COUNTRY')
In [106]:
barplot_visualization('CITY') # Visualisation of number of customer per city
  • Our dataset is unbalanced considering CITY and COUNTRY. The number of Customers from USA really overpass the rest. Let's have a look on STATUS.
In [39]:
sales_df['STATUS'].nunique() # number of unique values for STATUS
Out[39]:
6
In [107]:
barplot_visualization('STATUS')
  • Almost all transactions are shipped, the dataset extremely unbalanced regarding the STATUS
  • A solution to this could be to downsampling the shipped class or upsampling the classes other than shipped class. Here we're simply going to drop the STATUS feature.
In [41]:
#drop the status variable
sales_df.drop(columns=['STATUS'], inplace=True)
sales_df.head()
Out[41]:
QUANTITYORDERED PRICEEACH ORDERLINENUMBER SALES ORDERDATE QTR_ID MONTH_ID YEAR_ID PRODUCTLINE MSRP PRODUCTCODE CITY COUNTRY DEALSIZE
0 30 95.70 2 2871.00 2003-02-24 1 2 2003 Motorcycles 95 S10_1678 NYC USA Small
1 34 81.35 5 2765.90 2003-05-07 2 5 2003 Motorcycles 95 S10_1678 Reims France Small
2 41 94.74 2 3884.34 2003-07-01 3 7 2003 Motorcycles 95 S10_1678 Paris France Medium
3 45 83.26 6 3746.70 2003-08-25 3 8 2003 Motorcycles 95 S10_1678 Pasadena USA Medium
4 49 100.00 14 5205.27 2003-10-10 4 10 2003 Motorcycles 95 S10_1678 San Francisco USA Medium
In [42]:
barplot_visualization('PRODUCTLINE')
  • We define again a small function to add dummy variables to replace categorical variables in our dataset
In [43]:
def dummies(x):
  dummy=pd.get_dummies(sales_df[x])
  sales_df.drop(columns=x, inplace=True)
  return pd.concat([sales_df,dummy], axis=1)
  • Applying the above function to add dummy variables for countries in our dataset
In [44]:
sales_df=dummies('COUNTRY')
sales_df.head()
Out[44]:
QUANTITYORDERED PRICEEACH ORDERLINENUMBER SALES ORDERDATE QTR_ID MONTH_ID YEAR_ID PRODUCTLINE MSRP ... Italy Japan Norway Philippines Singapore Spain Sweden Switzerland UK USA
0 30 95.70 2 2871.00 2003-02-24 1 2 2003 Motorcycles 95 ... 0 0 0 0 0 0 0 0 0 1
1 34 81.35 5 2765.90 2003-05-07 2 5 2003 Motorcycles 95 ... 0 0 0 0 0 0 0 0 0 0
2 41 94.74 2 3884.34 2003-07-01 3 7 2003 Motorcycles 95 ... 0 0 0 0 0 0 0 0 0 0
3 45 83.26 6 3746.70 2003-08-25 3 8 2003 Motorcycles 95 ... 0 0 0 0 0 0 0 0 0 1
4 49 100.00 14 5205.27 2003-10-10 4 10 2003 Motorcycles 95 ... 0 0 0 0 0 0 0 0 0 1

5 rows × 32 columns

Obtaining dummies for PRODUCTLINE and DEALSIZE

In [45]:
#Obtain dummies for 'PRODUCTLINE' and 'DEALSIZE'
sales_df=dummies('PRODUCTLINE')
sales_df=dummies('DEALSIZE')

Since the number of unique product code is 109, if we add one-hot variables, there would be additional 109 columns, we can avoid that by doing categorical encoding. It's important to avoid curse of dimensionality.

In [46]:
sales_df['PRODUCTCODE']=pd.Categorical(sales_df['PRODUCTCODE']).codes
  • We're then going to aggregate sales to determine when(month) generaly the sales have a peak.
In [48]:
# grouping data by order date
sales_df_group=sales_df.groupby(by="ORDERDATE").sum()
sales_df_group
Out[48]:
QUANTITYORDERED PRICEEACH ORDERLINENUMBER SALES QTR_ID MONTH_ID YEAR_ID MSRP PRODUCTCODE Australia ... Classic Cars Motorcycles Planes Ships Trains Trucks and Buses Vintage Cars Large Medium Small
ORDERDATE
2003-01-06 151 288.78 10 12133.25 4 4 8012 363 174.0 0 ... 0 0 0 0 0 0 4 0 2 2
2003-01-09 142 284.96 10 11432.34 4 4 8012 372 181.0 0 ... 0 0 0 0 0 0 4 0 2 2
2003-01-10 80 150.14 3 6864.05 2 2 4006 155 37.0 0 ... 0 0 0 0 0 0 2 0 1 1
2003-01-29 541 1417.54 136 54702.00 16 16 32048 1695 723.0 0 ... 3 0 0 0 0 7 6 0 10 6
2003-01-31 443 1061.89 91 44621.96 13 13 26039 1365 720.0 0 ... 7 0 0 0 2 4 0 0 8 5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2005-05-13 259 561.18 21 31821.90 12 30 12030 728 101.0 0 ... 0 6 0 0 0 0 0 2 3 1
2005-05-17 509 1269.43 105 59475.10 28 70 28070 1669 462.0 0 ... 14 0 0 0 0 0 0 2 8 4
2005-05-29 607 1148.40 94 51233.18 30 75 30075 1328 797.0 13 ... 7 0 0 0 0 0 8 0 8 7
2005-05-30 187 542.16 18 14578.75 14 35 14035 618 265.0 0 ... 0 0 0 0 0 0 7 0 1 6
2005-05-31 696 1561.40 112 78918.03 38 95 38095 2065 899.0 0 ... 7 0 0 0 0 11 1 3 9 7

252 rows × 38 columns

In [49]:
#using a line plot
fig=plt.Figure(figsize=(12,6))
fig=px.line(x=sales_df_group.index,y=sales_df_group['SALES'],height=600)
fig.show() # we observe peak of the sales mainly in november
  • Evaluating feature correlations
In [50]:
# correlation matrix between variables
plt.figure(figsize=(20,20))
corr_matrix=sales_df.iloc[:,:10].corr()
sns.heatmap(corr_matrix, annot=True, cbar=False)
Out[50]:
<matplotlib.axes._subplots.AxesSubplot at 0x16c8d7bf808>

It looks like the Quarter ID and the monthly IDs are highly correlated Then lets's drop 'QTR_ID'. Then we can observe a strong correlation between MSRP and PRICEEACH. It is normal because MSRP are standardized prices. and PRICEEACH and SALES are also highly correlated, we can understand that for when PRICEEACH goes up SALES also goes up. SALES are obviously PRICEEACH multiplied by QUANTITY.

In [51]:
sales_df.drop("QTR_ID", inplace=True, axis=1)
sales_df.shape
Out[51]:
(2823, 39)

Let's plot distplots: Distplot shows the (1) histogram, (2) kde plot and (3) rug plot.

  • (2) Kde Plot: Kernel Density Estimate is used for visualizing the Probability Density of a continuous variable.
  • (3) Rug plot: plot of data for a single quantitative variable, displayed as marks along an axis (one-dimensional scatter plot).
In [52]:
import plotly.figure_factory as ff

plt.figure(figsize=(10,10))

for i in range(8):
  if sales_df.columns[i] !='ORDERLINENUMBER' and sales_df.columns[i] !='ORDERDATE':
    fig=ff.create_distplot([sales_df[sales_df.columns[i]].apply(lambda x: float(x))], ['distplot'])
    fig.update_layout(title_text=sales_df.columns[i])
    fig.show()
<Figure size 720x720 with 0 Axes>
  • QUANTITYORDERED are sort of uniformly distributed
In [53]:
sales_df.drop('ORDERDATE', axis=1, inplace=True)

Let's Visualize the relationship between variables using pairplots.

In [54]:
%matplotlib inline

plt.figure(figsize=(15,15))

fig=px.scatter_matrix(sales_df, dimensions=sales_df.columns[:8], color='MONTH_ID')

fig.update_layout(
    title='Sales Data',
    width=1100,
    height=1100,
)
fig.show()
<Figure size 1080x1080 with 0 Axes>
  • A trend exists between 'SALES' and 'QUANTITYORDERED'
  • A trend exists between 'MSRP' and 'PRICEEACH'
  • A trend exists between 'PRICEEACH' and 'SALES'
  • It seems that sales growth exists as we move from 2013 to 2014 to 2015 ('SALES' vs. 'YEAR_ID')

After performing this data exploratory we move to clustering of our data. We are going to use the elbow method to determine the optimal number of clusters(a minimum of three group) first.

  • The elbow method is a heuristic method of interpretation and validation of consistency within cluster analysis designed to help find the appropriate number of clusters in a dataset.
  • The idea is that Weighted Som of Squares(WSS) of the distances
  • If the line chart looks like an arm, then the "elbow" on the arm is the value of k (number of clusters) that is the best.

UNDERSTANDING THE INTUITION AND THE THEORY BEHIND KMEANS CLUSTERING

alt text

alt text

FINDING THE OPTIMAL NUMBER OF CLUSTERS USING ELBOW METHOD

  • The idea is the Within Cluster Sum of Squares (WCSS) goes down with the incresaing of the number of clusters.
  • Up to k number of clusters the WCSS gets low enough and reach the "elbow" point, so the corresponding k is optimal

alt text

In [55]:
import copy
sales_df_copy=copy.copy(sales_df)
sales_df_copy.drop("CITY", axis=1, inplace=True) # we drop the variable 'CITY' 
  • Before searching the optimal k let's scale our data, it is necessary for the performance of our model
In [56]:
scaler=StandardScaler()
sales_df_copy_scaled=scaler.fit_transform(sales_df_copy)
In [57]:
sales_df_copy_scaled.shape
Out[57]:
(2823, 37)

Function to implement the elbow method

In [58]:
def elbow(k,X):

    scores=[]
    range_values=range(1,k)

    for i in range_values:
      kmeans=KMeans(n_clusters=i)
      kmeans.fit(X)
      scores.append(kmeans.inertia_) # inertia is the sum of the disances of samples to their cluster center

    plt.plot(scores,'bx-')
    plt.title('Finding right number of clusters')
    plt.xlabel('clusters')
    plt.ylabel('scores')
    plt.show()
In [60]:
#calling the above function to determine the optimal number of clusters k
elbow(12,sales_df_copy_scaled)
  • Notice the difficulty to determine the optimal number of cluster because the "elbow" is a little tricky to be identified from the curve, may be k=3 or k=5, not really clear right!
  • One way to solution this is to implement a autoencoder, a dimensionality reduction model.But let's take first k=5 to examine clusters.
In [61]:
# Cluster the data using kmeans, k=5
kmeans=KMeans(5)
kmeans.fit(sales_df_copy_scaled)
labels=kmeans.labels_ #labels of clusters
In [53]:
labels
Out[53]:
array([1, 1, 2, ..., 4, 4, 4])
In [54]:
kmeans.cluster_centers_.shape
Out[54]:
(5, 37)
In [62]:
#Converting clusters centers into a dataframe
cluster_centers=pd.DataFrame(data=kmeans.cluster_centers_, columns=[sales_df.columns.drop('CITY')])
cluster_centers
Out[62]:
QUANTITYORDERED PRICEEACH ORDERLINENUMBER SALES MONTH_ID YEAR_ID MSRP PRODUCTCODE Australia Austria ... Classic Cars Motorcycles Planes Ships Trains Trucks and Buses Vintage Cars Large Medium Small
0 -0.462749 -0.776728 0.025931 -0.827276 -0.004575 0.004775 -0.586974 0.177730 0.070701 -0.022136 ... -0.093109 0.075134 0.100371 -0.300637 0.095817 0.003370 0.134004 -0.242672 -0.980703 1.096370
1 -0.238120 -0.111504 0.374193 -0.305340 0.129536 -0.099789 -0.205639 0.209124 -0.264819 -0.140961 ... -0.304345 -0.305805 0.592055 -0.095340 0.006298 -0.284337 0.418106 -0.242672 -0.150357 0.262666
2 -0.030694 0.007098 0.068215 -0.270448 -0.044483 0.007269 -0.364269 1.131458 -0.229371 -0.014029 ... -0.721812 -0.364452 -0.348674 3.326275 -0.167454 -0.345470 -0.523371 -0.242672 -0.068249 0.180226
3 0.278220 0.595357 -0.034695 0.482708 0.012332 -0.014401 0.423643 -0.273510 0.013081 0.028796 ... 0.168128 0.025343 -0.054592 -0.300637 -0.038311 0.110657 -0.038922 -0.242672 1.019676 -0.912100
4 1.245428 0.800220 -0.259579 2.573861 -0.088008 0.136857 1.430265 -0.871802 -0.084647 0.043374 ... 0.580097 0.011712 -0.164278 -0.300637 -0.128350 -0.201008 -0.151285 4.120788 -0.980703 -0.912100

5 rows × 37 columns

In [63]:
# In order to understand what these numbers mean, let's perform inverse transformation
cluster_centers = scaler.inverse_transform(cluster_centers)
cluster_centers = pd.DataFrame(data = cluster_centers, columns = [sales_df.columns.drop('CITY')])
cluster_centers
Out[63]:
QUANTITYORDERED PRICEEACH ORDERLINENUMBER SALES MONTH_ID YEAR_ID MSRP PRODUCTCODE Australia Austria ... Classic Cars Motorcycles Planes Ships Trains Trucks and Buses Vintage Cars Large Medium Small
0 30.585766 67.991387 6.575730 2030.427838 7.075730 2003.818431 77.130474 59.385949 8.302920e-02 1.642336e-02 ... 2.983577e-01 1.414234e-01 1.395985e-01 -3.053113e-16 4.288321e-02 1.076642e-01 2.700730e-01 7.008283e-16 -1.054712e-15 1.000000e+00
1 32.773585 81.409434 8.047170 2991.593208 7.566038 2003.745283 92.452830 60.377358 1.387779e-17 -3.469447e-17 ... 1.981132e-01 1.886792e-02 2.924528e-01 5.660377e-02 2.830189e-02 1.886792e-02 3.867925e-01 -7.632783e-17 4.150943e-01 5.849057e-01
2 34.793860 83.801711 6.754386 3055.849079 6.929825 2003.820175 86.078947 89.504386 8.771930e-03 1.754386e-02 ... 2.220446e-16 -4.163336e-16 2.775558e-17 1.000000e+00 -1.283695e-16 1.387779e-17 5.551115e-16 -1.942890e-16 4.561404e-01 5.438596e-01
3 37.802589 95.667306 6.319579 4442.814086 7.137540 2003.805016 117.737864 45.135922 6.877023e-02 2.346278e-02 ... 4.223301e-01 1.254045e-01 9.142395e-02 -9.436896e-16 2.103560e-02 1.407767e-01 1.990291e-01 3.469447e-17 1.000000e+00 -7.660539e-15
4 47.222930 99.799554 5.369427 8293.753248 6.770701 2003.910828 158.184713 26.242038 4.458599e-02 2.547771e-02 ... 6.178344e-01 1.210191e-01 5.732484e-02 -1.110223e-16 6.369427e-03 4.458599e-02 1.528662e-01 1.000000e+00 -7.771561e-16 9.992007e-16

5 rows × 37 columns

  • Cluster 4 (Highest) - This group represents customers who buy items in high quantity centered around ~47, they buy items in all price range leaning towards high price items of ~99. They also correspond to the highest total sales around ~8293 and they are active throughout the year. They are the highest buyers of products with high MSRP ~158.

  • Cluster 3 - This group represents customers who buy items in varying quantity ~37, they tend to buy high price items ~95. Their sales is bit better average ~4442, they buy products with second highest MSRP of ~117.

  • Cluster 0 (lowest) - This group represents customers who buy items in low quantity ~30. They tend to buy low price items ~68. Their sales ~2030 is lower than other clusters and they are extremely active around holiday season. They buy products with low MSRP ~77.

  • Cluster 2 - This group represents customers who are only active during the holidays. they buy in lower quantity ~34, but they tend to buy average price items around ~83. They also correspond to lower total sales around ~3055, they tend to buy items with MSRP around 86.
  • Cluster 1 - This group represents customers who buy items in varying quantity ~32, they tend to buy average price items ~81. Their sales ~2991.
In [64]:
#Adding the labels on my data(on corresponding samples)
sales_df_cluster=pd.concat([sales_df_copy, pd.DataFrame({'cluster':labels})], axis=1)
sales_df_cluster
Out[64]:
QUANTITYORDERED PRICEEACH ORDERLINENUMBER SALES MONTH_ID YEAR_ID MSRP PRODUCTCODE Australia Austria ... Motorcycles Planes Ships Trains Trucks and Buses Vintage Cars Large Medium Small cluster
0 30 95.70 2 2871.00 2 2003 95 0 0 0 ... 1 0 0 0 0 0 0 0 1 0
1 34 81.35 5 2765.90 5 2003 95 0 0 0 ... 1 0 0 0 0 0 0 0 1 0
2 41 94.74 2 3884.34 7 2003 95 0 0 0 ... 1 0 0 0 0 0 0 1 0 3
3 45 83.26 6 3746.70 8 2003 95 0 0 0 ... 1 0 0 0 0 0 0 1 0 3
4 49 100.00 14 5205.27 10 2003 95 0 0 0 ... 1 0 0 0 0 0 0 1 0 3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2818 20 100.00 15 2244.40 12 2004 54 108 0 0 ... 0 0 1 0 0 0 0 0 1 2
2819 29 100.00 1 3978.51 1 2005 54 108 0 0 ... 0 0 1 0 0 0 0 1 0 2
2820 43 100.00 4 5417.57 3 2005 54 108 0 0 ... 0 0 1 0 0 0 0 1 0 2
2821 34 62.24 1 2116.16 3 2005 54 108 0 0 ... 0 0 1 0 0 0 0 0 1 2
2822 47 65.52 9 3079.44 5 2005 54 108 0 0 ... 0 0 1 0 0 0 0 1 0 2

2823 rows × 38 columns

A function to plot the histogram for each feature based on clusters

In [65]:
def hist_by_cluster(name_var, data,k):

    '''A function to plot the histogram for each feature based on clusters'''

    for i in name_var:
      plt.figure(figsize=(30,6))
      for j in range(k):
        plt.subplot(1,k,j+1)
        cluster=data[data['cluster']==j]
        cluster[i].hist()
        plt.title('{} \nCluster - {}'.format(i,j))

      plt.show()
In [66]:
#Calling the above function
hist_by_cluster(sales_df_cluster.columns[:8],sales_df_cluster,5)
  • Cluster-1 and cluster-2 are almost similar regarding sales, price of items and quantity ordered
  • cluster-3 and cluster-4 are almost similar regarding the year and month of transactions and the quantity ordered

Reduction and visualization of the original data to 3 dimensional using PCA

In [67]:
#reduction of original data to 3 dimensional using PCA
pca=PCA(n_components=3)
principal_comp=pca.fit_transform(sales_df_copy_scaled)
principal_comp
Out[67]:
array([[-0.48629197, -1.17057435,  0.95994784],
       [-0.7846912 , -1.02282183,  1.38206074],
       [ 1.43145594,  0.13597043,  0.23370255],
       ...,
       [ 0.38983051,  4.08995278, -0.08235796],
       [-2.78896077,  2.36643577,  1.50046211],
       [-0.70551567,  3.28212272, -0.52953725]])
In [68]:
#Converting the PCA axes into a dataframe
pca_df=pd.DataFrame(data=principal_comp, columns=['pca1','pca2','pca3'])
pca_df.head()
Out[68]:
pca1 pca2 pca3
0 -0.486292 -1.170574 0.959948
1 -0.784691 -1.022822 1.382061
2 1.431456 0.135970 0.233703
3 1.288467 -0.206578 -0.601451
4 1.976043 -0.148603 -1.050492
In [69]:
#concatenate the clusters labels to the dataframe
pca_df=pd.concat([pca_df,pd.DataFrame({'cluster':labels})], axis=1)
pca_df
Out[69]:
pca1 pca2 pca3 cluster
0 -0.486292 -1.170574 0.959948 0
1 -0.784691 -1.022822 1.382061 0
2 1.431456 0.135970 0.233703 3
3 1.288467 -0.206578 -0.601451 3
4 1.976043 -0.148603 -1.050492 3
... ... ... ... ...
2818 -2.450396 1.043438 -1.516478 2
2819 -0.158221 3.702872 0.653198 2
2820 0.389831 4.089953 -0.082358 2
2821 -2.788961 2.366436 1.500462 2
2822 -0.705516 3.282123 -0.529537 2

2823 rows × 4 columns

In [70]:
#Visualizing the  clusters using 3D-Scatterplot
plt.figure(figsize=(30,30))
fig=px.scatter_3d(pca_df, x='pca1',y='pca2', z='pca3',color='cluster', symbol='cluster', size_max=18, opacity=0.7)
fig.update_layout(margin=dict(l=0,r=0,b=0,t=0))
<Figure size 2160x2160 with 0 Axes>

Visualizing the clusters using 2D-Scatterplot: axe1 and axe2

In [71]:
plt.figure(figsize=(30,30))
fig=px.scatter(pca_df, x='pca1',y='pca2',color='cluster', symbol='cluster', size_max=18, opacity=0.7)
fig.update_layout(margin=dict(l=0,r=0,b=0,t=0))
<Figure size 2160x2160 with 0 Axes>
  • In the first graph the groups are not quite clear.
  • In 2D It appears 2 group of customers we can consider, the rest are shared between the two. We have to solution this issues. After let's apply autoencoder to have clearer clusters.

PERFORM DIMENSIONALITY REDUCTION USING AUTOENCODERS

In [46]:
sales_df_copy.shape
Out[46]:
(2823, 37)
In [72]:
# Defining our autoencoder

from keras.optimizers import SGD

# Glorot Uniform initializer: https://keras.rstudio.com/reference/initializer_glorot_uniform.html
input_df=Input(shape=(37,))
x=Dense(50, activation='relu')(input_df)
x=Dense(500, activation='relu', kernel_initializer='glorot_uniform')(x)
x=Dense(500, activation='relu', kernel_initializer='glorot_uniform')(x)
x=Dense(2000, activation='relu', kernel_initializer='glorot_uniform')(x)
encoded=Dense(8, activation='relu',kernel_initializer='glorot_uniform')(x) #my encoder
x=x=Dense(500, activation='relu', kernel_initializer='glorot_uniform')(encoded)
decoded=Dense(37, activation='relu', kernel_initializer='glorot_uniform')(x) #my decoder

#autoendocder
autoencoder=Model(input_df, decoded)

#encoder - used for dimensionality reduction
encoder=Model(input_df, encoded)

autoencoder.compile(optimizer='adam',loss='mean_squared_error')
In [75]:
#Training our autoencoder
autoencoder.fit(sales_df_copy_scaled, sales_df_copy_scaled, batch_size=128, epochs=500, verbose=0)
Out[75]:
<tensorflow.python.keras.callbacks.History at 0x16c8f8b8c88>
In [76]:
autoencoder.save_weights('autoencoder_1.h5')
In [77]:
pred=encoder.predict(sales_df_copy_scaled)# our dimension-reduced data
In [79]:
#Finding optimal number of clusters using our elbow function applied to our dimension-reduced data
elbow(12,pred)

Now It is clear that the elbow method above gives an optimal number of clusters k=3

In [80]:
# Cluster the data using kmeans, k=3
kmeans=KMeans(3)
kmeans.fit(sales_df_copy_scaled)
labels=kmeans.labels_
In [84]:
labels
Out[84]:
array([1, 1, 0, ..., 0, 1, 0])
In [85]:
kmeans.cluster_centers_.shape
Out[85]:
(3, 37)
In [81]:
#Converting cluster centers into a dataframe
cluster_centers=pd.DataFrame(data=kmeans.cluster_centers_, columns=[sales_df.columns.drop('CITY')])
cluster_centers
Out[81]:
QUANTITYORDERED PRICEEACH ORDERLINENUMBER SALES MONTH_ID YEAR_ID MSRP PRODUCTCODE Australia Austria ... Classic Cars Motorcycles Planes Ships Trains Trucks and Buses Vintage Cars Large Medium Small
0 0.293073 0.580208 -0.010805 0.458608 0.003368 -0.012482 0.376634 -0.165542 -0.013717 0.010642 ... 0.085141 -0.014094 -0.048853 -0.020232 -0.052121 0.064221 -0.064346 -0.242672 1.019676 -0.91210
1 -0.468912 -0.724370 0.043454 -0.810304 0.007142 -0.003285 -0.581757 0.285478 0.025174 -0.016801 ... -0.162957 0.013781 0.072858 0.058660 0.071986 -0.044714 0.087993 -0.242672 -0.980703 1.09637
2 1.245428 0.800220 -0.259579 2.573861 -0.088008 0.136857 1.430265 -0.871802 -0.084647 0.043374 ... 0.580097 0.011712 -0.164278 -0.300637 -0.128350 -0.201008 -0.151285 4.120788 -0.980703 -0.91210

3 rows × 37 columns

In [82]:
# In order to understand what these numbers mean, let's perform inverse transformation
cluster_centers = scaler.inverse_transform(cluster_centers)
cluster_centers = pd.DataFrame(data = cluster_centers, columns = [sales_df.columns.drop('CITY')])
cluster_centers
Out[82]:
QUANTITYORDERED PRICEEACH ORDERLINENUMBER SALES MONTH_ID YEAR_ID MSRP PRODUCTCODE Australia Austria ... Classic Cars Motorcycles Planes Ships Trains Trucks and Buses Vintage Cars Large Medium Small
0 37.947254 95.361741 6.420520 4398.433699 7.104769 2003.806358 115.848988 48.545520 0.062139 0.020954 ... 0.382948 0.112717 0.093208 7.731214e-02 0.018786 0.126445 0.188584 -5.273559e-16 1.000000e+00 -8.715251e-15
1 30.525741 69.047496 6.649766 2061.682800 7.118565 2003.812793 77.340094 62.788612 0.071763 0.017161 ... 0.265211 0.121685 0.131045 9.906396e-02 0.039002 0.092824 0.251170 -1.595946e-16 -1.054712e-15 1.000000e+00
2 47.222930 99.799554 5.369427 8293.753248 6.770701 2003.910828 158.184713 26.242038 0.044586 0.025478 ... 0.617834 0.121019 0.057325 -1.110223e-16 0.006369 0.044586 0.152866 1.000000e+00 -7.771561e-16 9.992007e-16

3 rows × 37 columns

  • Cluster 2 (High) - This group represents customers who buy items in high quantity centered around ~47, they buy items in range leaning towards high price items of ~99. They also correspond to the highest total sales around ~8293 and they are active throughout the year. They are the highest buyers of products with high MSRP ~158.
  • Cluster 0 (middle) - This group represents customers who buy items in high quantity centered around ~37, they buy items in range leaning towards items of ~95. They correspond to the second total sales around ~4398. They are the buyers of products with MSRP ~115.
  • Cluster 1 (low) - They represent the lowest groups in terms of sales ~ 2061, a MSRP around 77 and quantity centered around 69
In [83]:
#Adding the labels on my data(on corresponding samples)
sales_df_cluster=pd.concat([sales_df_copy, pd.DataFrame({'cluster':labels})], axis=1)
sales_df_cluster
Out[83]:
QUANTITYORDERED PRICEEACH ORDERLINENUMBER SALES MONTH_ID YEAR_ID MSRP PRODUCTCODE Australia Austria ... Motorcycles Planes Ships Trains Trucks and Buses Vintage Cars Large Medium Small cluster
0 30 95.70 2 2871.00 2 2003 95 0 0 0 ... 1 0 0 0 0 0 0 0 1 1
1 34 81.35 5 2765.90 5 2003 95 0 0 0 ... 1 0 0 0 0 0 0 0 1 1
2 41 94.74 2 3884.34 7 2003 95 0 0 0 ... 1 0 0 0 0 0 0 1 0 0
3 45 83.26 6 3746.70 8 2003 95 0 0 0 ... 1 0 0 0 0 0 0 1 0 0
4 49 100.00 14 5205.27 10 2003 95 0 0 0 ... 1 0 0 0 0 0 0 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2818 20 100.00 15 2244.40 12 2004 54 108 0 0 ... 0 0 1 0 0 0 0 0 1 1
2819 29 100.00 1 3978.51 1 2005 54 108 0 0 ... 0 0 1 0 0 0 0 1 0 0
2820 43 100.00 4 5417.57 3 2005 54 108 0 0 ... 0 0 1 0 0 0 0 1 0 0
2821 34 62.24 1 2116.16 3 2005 54 108 0 0 ... 0 0 1 0 0 0 0 0 1 1
2822 47 65.52 9 3079.44 5 2005 54 108 0 0 ... 0 0 1 0 0 0 0 1 0 0

2823 rows × 38 columns

In [84]:
#Plot the histogram for each based on cluster
hist_by_cluster(sales_df_cluster.columns[:8],sales_df_cluster,3)
  • Notice that the three clusters are really different along variables
    Then let's perofrm PCA to visualize data along with the three(03) clusters
In [85]:
pca=PCA(n_components=3)
principal_comp=pca.fit_transform(sales_df_copy_scaled)
principal_comp
Out[85]:
array([[-0.4862578 , -1.1894561 ,  1.01793484],
       [-0.7846195 , -1.03220866,  1.40592376],
       [ 1.43150915,  0.11451493,  0.25769091],
       ...,
       [ 0.38990777,  4.10750587, -0.16748894],
       [-2.78882774,  2.38516665,  1.46083854],
       [-0.70550538,  3.30697404, -0.5308877 ]])
In [86]:
pca_df=pd.DataFrame(data=principal_comp, columns=['pca1','pca2','pca3'])
pca_df.head()
Out[86]:
pca1 pca2 pca3
0 -0.486258 -1.189456 1.017935
1 -0.784619 -1.032209 1.405924
2 1.431509 0.114515 0.257691
3 1.288443 -0.223831 -0.547441
4 1.975992 -0.143618 -0.997281
In [87]:
#concatenate the clusters labels to the dataframe
pca_df=pd.concat([pca_df,pd.DataFrame({'cluster':labels})], axis=1)
pca_df
Out[87]:
pca1 pca2 pca3 cluster
0 -0.486258 -1.189456 1.017935 1
1 -0.784619 -1.032209 1.405924 1
2 1.431509 0.114515 0.257691 0
3 1.288443 -0.223831 -0.547441 0
4 1.975992 -0.143618 -0.997281 0
... ... ... ... ...
2818 -2.450342 1.112941 -1.587333 1
2819 -0.158449 3.734194 0.825893 0
2820 0.389908 4.107506 -0.167489 0
2821 -2.788828 2.385167 1.460839 1
2822 -0.705505 3.306974 -0.530888 0

2823 rows × 4 columns

In [88]:
#Visualizing the  clusters using 3D-Scatterplot
plt.figure(figsize=(30,30))
fig=px.scatter_3d(pca_df, x='pca1',y='pca2', z='pca3',color='cluster', symbol='cluster', size_max=18, opacity=0.7)
fig.update_layout(margin=dict(l=0,r=0,b=0,t=0))
<Figure size 2160x2160 with 0 Axes>
  • Excellent!! We have now three main clear clusters